10/27/2019

Agenda

  • Polynomial regression
  • Step functions
  • Regression splines
  • Smoothing splines
  • Generalized additive models

Agenda

Linear models: major cons

  • Linear models are pretty much always wrong; if the truth is nonlinear, then the linear model will provide a biased estimate
  • Linear models depend on which transformation of the data you use, e.g. \(x\) versus \(\log(x)\)
  • Linear models do not estimate interactions among the predictors unless you explicitly build them in

How to improve

  • Last week, we saw that we can improve upon OLS using ridge and lasso regression by reducing the complexity of the model, and hence the variance of the estimates
  • This week, we relax the linearity assumption while still attempting to maintain as much interpretability as possible

Polynomial regression

Adding non linear terms

A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors in the model, e.g.

\[\text{mpg} = \beta_0 + \beta_1 \text{horsepower} + \beta_2 \text{horsepower}^2 + \varepsilon\]

We are predicting mpg using a non-linear function of horsepower, but it is still a linear model with \(X_1 = \text{horsepower}\) and \(X_2 = \text{horsepower}^2\).

  • We can use standard linear regression software to estimate the parameters
  • Polynomial regression: extending the linear model to accommodate non-linear relationships

Adding non linear terms

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \dots + \beta_d x_i^d + \varepsilon_i\]

  • Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\), i.e. \[\hat{f} (x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \dots + \hat{\beta}_d x_0^d\]
  • Since \(\hat{f} (x_0)\) is a linear function of the \(\hat{\beta}\)’s, can get a simple expression for the pointwise variances \(\text{Var}[\hat{f} (x_0)]\) at any value \(x_0\)
  • We either fix the degree \(d\) at some reasonably low value, or use cross-validation to choose \(d\)
  • Can fit using y ~ poly(x, degree = d) in formula

A note on confidence intervals

  • Since \(\hat{f} (x_0)\) is a linear function of the \(\hat{\beta}\)’s, we can get a simple expression for the pointwise variances \(\text{Var}[\hat{f} (x_0)]\) at any value \(x_0\)

Can use predict(fit, newdata = ..., se = T) and then use the 2 standard deviation rule.

Caveat: polynomials have notorious tail behavior - very bad for extrapolation. This is due to the fact that the polynomial function is defined globally (fitted using all of the data).

Adding non linear terms

Adding non linear terms

Adding non linear terms

Step Functions

Step Functions

Another way of creating transformations of a variable: cut the variable into distinct regions.

In order to fit a step function we use the cut() function.

Given cutpoints \(c_1, c_2, c_3\) in the range of \(X\), we construct \(4\) dummy variables

\[ \begin{aligned} &C_1(X) = \mathbb{I}(X \leq 92), \\ &C_2(X) = \mathbb{I}(92 < X \leq 138) \\ &C_3(X) = \mathbb{I}(138 < X \leq 184) \\ &C_4(X) = \mathbb{I}(X > 184) \end{aligned}\]

We only use \(3\) of them (one is the baseline). In general, given \(K\) cutpoints there are \(K+1\) intervals and \(K\) dummy variables.

Step Functions

Step Functions

Step Functions

  • Only uses some datapoints in the estimation of each \(\beta_j\)
  • Easy to work with: create a series of dummy variables representing each group
  • Useful way of creating interactions that are easy to interpret, e.g. \[\mathbb{I}(horsepower \leq 150) \times \text{Speed}, \mathbb{I}(horsepower > 150) \times \text{Speed}\] would allow for different linear functions in each age category
  • Choice of cutpoints or knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available

Splines

Spline approach

Let’s take the best of the two previous ideas:

  • smoothness and flexibility, from polynomial regression

  • local support, from step function approach

  • Instead of a single polynomial in \(X\) over its whole domain, we can rather use different polynomials in regions defined by knots, e.g. \[y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \varepsilon_i, \quad &\text{if } x_i \leq c \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \varepsilon_i, \quad &\text{if } x_i > c \end{cases}\]

  • Better to add constraints to the polynomials, e.g. continuity

  • Splines have the “maximum” amount of continuity

Two distinct polynomials

Two distinct \(3^{rd}\) degree polynomials. How many degrees of freedom (parameters)?

Two distinct polynomials: continuity

Two distinct \(3^{rd}\) degree polynomials with continuity. How many degrees of freedom (parameters)?

Splines

How many degrees of freedom (parameters)? Splines impose the continuity of all derivatives!

Linear splines

A linear spline with knots at \(\xi_k, k = 1,2,\dots, K\) is a piecewise linear polynomial continuous at each knot. We can represent this model as \[y_i = \beta_0 + \beta_1 b_1(x_i)+ \beta_2 b_2(x_i)+ \dots + \beta_{K+1} b_{K+1}(x_i) + \varepsilon_i\]

where the \(b_k\) are basis functions (truncated power basis) \[\begin{aligned} &b_1(x_i) = x_i \\ &b_{k+1}(x_i)=(x_i - \xi_k)_{+}, \quad k = 1,2,\dots,K \end{aligned}\] Here the \((\dots)_{+}\) means positive part, i.e.

\[(x_i - \xi_k)_{+} = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}\]

Linear splines

Cubic splines

A cubic spline with knots at \(\xi_k, k = 1,2,\dots, K\) is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot. We can represent this model as \[y_i = \beta_0 + \beta_1 b_1(x_i)+ \beta_2 b_2(x_i)+ \dots + \beta_{K+3} b_{K+3}(x_i) + \varepsilon_i\]

where the \(b_k\) are basis functions \[\begin{aligned} &b_1(x_i) = x_i \\ &b_2(x_i) = x_i^2 \\ &b_3(x_i) = x_i^3 \\ &b_{k+3}(x_i)=(x_i - \xi_k)^{3}_{+}, \quad k = 1,2,\dots,K \end{aligned}\]

Cubic splines

Equivalent representation

  • Since the space of spline functions of a particular order and knot sequence is a vector space, there are many equivalent bases for representing them (just as there are for ordinary polynomials)
  • While the truncated power basis is conceptually simple, it is not too attractive numerically: powers of large numbers can lead to severe rounding problems
  • In practice, we often use another basis: the B-spline basis, which allows for efficient computations even when the number of knots \(K\) is large (each basis function has a local support)

\[\hat{y}_i = \beta_0 + \beta_1 b_1(x_i)+ \beta_2 b_2(x_i)+ \dots + \beta_{K+3} b_{K+3}(x_i)\]

In R, we can simply fit a “linear” model on the spline basis!

fit <- lm(Y2 ~ bs(X, df = 10, degree = 3))

Linear B-splines

Cubic B-splines

Example of fit

\[\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 b_1(x)+ \hat{\beta}_2 b_2(x_i)+ \dots + \hat{\beta}_{K+3} b_{K+3}(x)\]

Example of fit

\[\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 b_1(x)+ \hat{\beta}_2 b_2(x_i)+ \dots + \hat{\beta}_{K+3} b_{K+3}(x)\]

Example of fit

\[\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 b_1(x)+ \hat{\beta}_2 b_2(x_i)+ \dots + \hat{\beta}_{K+3} b_{K+3}(x)\]

Example of fit

\[\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 b_1(x)+ \hat{\beta}_2 b_2(x_i)+ \dots + \hat{\beta}_{K+3} b_{K+3}(x)\]

Question time